Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values

نویسندگان

  • Pedro J. García-Laencina
  • Pedro Abreu
  • Miguel Henriques Abreu
  • Noémia Afonoso
چکیده

Breast cancer is the most frequently diagnosed cancer in women. Using historical patient information stored in clinical datasets, data mining and machine learning approaches can be applied to predict the survival of breast cancer patients. A common drawback is the absence of information, i.e., missing data, in certain clinical trials. However, most standard prediction methods are not able to handle incomplete samples and, then, missing data imputation is a widely applied approach for solving this inconvenience. Therefore, and taking into account the characteristics of each breast cancer dataset, it is required to perform a detailed analysis to determine the most appropriate imputation and prediction methods in each clinical environment. This research work analyzes a real breast cancer dataset from Institute Portuguese of Oncology of Porto with a high percentage of unknown categorical information (most clinical data of the patients are incomplete), which is a challenge in terms of complexity. Four scenarios are evaluated: (I) 5-year survival prediction without imputation and 5-year survival prediction from cleaned dataset with (II) Mode imputation, (III) Expectation-Maximization imputation and (IV) K-Nearest Neighbors imputation. Prediction models for breast cancer survivability are constructed using four different methods: K-Nearest Neighbors, Classification Trees, Logistic Regression and Support Vector Machines. Experiments are performed in a nested ten-fold cross-validation procedure and, according to the obtained results, the best results are provided by the K-Nearest Neighbors algorithm: more than 81% of accuracy and more than 0.78 of area under the Receiver Operator Characteristic curve, which constitutes very good results in this complex scenario.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Association Between Diabetes Mellitus With Five-Year Survival Rate of Breast Cancer Patients

Background: Breast cancer is a highly prevalent malignancy leading to death across the world. However, patient survival is greatly affected by making a diabetes diagnosis. The present study aimed to assessment the association between diabetes mellitus with five-year survival rate of breast cancer patients at the cancer treatment centers. Methods: This Retrospective follow‐up study was conducte...

متن کامل

تحلیل درستنمایی ماکزیمم مدل رگرسیون لجستیک در حالتی که داده های متغیرهای پیشگو کامل نیستند ولی متغیرهای کمکی وجود دارند

Background and Objectives: Missing data exist in many studies, e.g. in regression models, and they decrease the model's efficacy. Many methods have been suggested for handling incomplete data: they have generally focused on missing outcome values. But covariate values can also be missing.Materials and Methods: In this paper we study the missing imputation by the EM algorithm and auxiliary varia...

متن کامل

Missing data imputation using statistical and machine learning methods in a real breast cancer problem

OBJECTIVES Missing data imputation is an important task in cases where it is crucial to use all available data and not discard records with missing values. This work evaluates the performance of several statistical and machine learning imputation methods that were used to predict recurrence in patients in an extensive real breast cancer data set. MATERIALS AND METHODS Imputation methods based...

متن کامل

Survival analysis of breast cancer patients with different chronic diseases through parametric and semi-parametric approaches

Introduction: There is a lack of information on the extent of dependency between chronic diseases and the survival rate of breast cancer. Until date, none of the models proposed has determined the impact of chronic diseases on breast cancer survival. This study, therefore, aimed to investigate the impacts of chronic diseases such as diabetes, blood pressure, and endocrine di...

متن کامل

Survival analysis of breast cancer patients with different chronic diseases through parametric and semi-parametric approaches

Introduction: There is a lack of information on the extent of dependency between chronic diseases and the survival rate of breast cancer. Until date, none of the models proposed has determined the impact of chronic diseases on breast cancer survival. This study, therefore, aimed to investigate the impacts of chronic diseases such as diabetes, blood pressure, and endocrine di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computers in biology and medicine

دوره 59  شماره 

صفحات  -

تاریخ انتشار 2015